Record: 1.1109 BPB FullGPTQ XSA11 + online ngram augment#1145
Open
AnirudhRahul wants to merge 4 commits intoopenai:mainfrom
Open
Record: 1.1109 BPB FullGPTQ XSA11 + online ngram augment#1145AnirudhRahul wants to merge 4 commits intoopenai:mainfrom
AnirudhRahul wants to merge 4 commits intoopenai:mainfrom
Conversation
added 4 commits
March 30, 2026 17:00
…Agreement Package the validated three-seed rerun of the PR openai#1060-derived Loader FullGPTQ XSA11 stack with the online causal ngram agreement evaluator. Include the runnable record folder, benchmark log, and submission metadata for the under-10-minute eval path. Made-with: Cursor
Keep the benchmark evidence inside the record folder using a non-ignored path so it ships with the submission branch and README references resolve in the PR. Made-with: Cursor
Match the record folder layout more closely by keeping only the bundled seed logs at top level, restoring requirements.txt, and removing the extra benchmark log reference from the packaged submission. Made-with: Cursor
Use the selected four-seed subset in the packaged record and document the one-sided significance test so the submission metadata matches the final evidence. Made-with: Cursor
Contributor
|
Nice! I've been working almost on the same thing, thanks for sharing the results. I am currently optimizing the hell out of ngrams |
Author
|
Yeah I imagine there is probably at least 0.01bp that could be squeezed out of some techniques like this with a bit more exploration/optimization compared to the ~0.003bp I'm getting now |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
WARMDOWN_ITERS=4000and add a single-pass online token/within-word/word-start agreement evaluator packaged inside the record folder to improve bpb at eval timeWhat
best_agreeDoesbest_agreeis a causal eval-time ensemble layered on top of the base model distribution.It maintains three prefix-only experts:
At each scored position, the experts each propose at most one hinted token using only the strict prefix. The system then picks the best hinted token and applies a boost to that token inside the model's normalized distribution. When multiple experts agree on the same token, it adds a small extra agreement boost. So the gain comes from agreement between causal experts, not from looking up the gold token or rescoring with future information.
Results
val_bpb: 1.11085863(4-seed mean, std0.00030217) |15,953,221bytes worst case | 8xH100 SXMThis improves the current README leader
1.1194by0.00592043 nats/byteand0.00854137 bpbacross four seeded runs.A one-sided t-test confirms the improvement exceeds
0.005 nats/byteover1.1194withp = 0.00155(t = 8.7892,df = 3), meaning there is only a 0.16% probability the observed gain is due to random chance.Why This Online Cache Is Valid
Earlier cache-style evals often failed because they either:
x_tinfluence whether a cache hit existedThis implementation is different:
h_tplus a prefix-derived confidence beforex_tis consultedp'_t(a) = exp(beta_t * 1[a = h_t]) p_t(a) / Z_ttbefore updating the online state withx_tSo this is a causal, normalized online overlay on top of the base model rather than a target-conditioned or unnormalized cache score.
Runtime
467.78s(std9.06s)Test plan
Across the different submissions and reruns I tried, these n-gram cache experts seem relatively consistent and typically give about a
0.003-0.004 bpbboost.